home *** CD-ROM | disk | FTP | other *** search
Text File | 1999-09-24 | 25.5 KB | 550 lines | [TEXT/MPS ] |
- #=======================================================================
- # FTP file name: README.TXT
- #
- # Contents: Background information on Unicode mapping tables
- # for Mac OS text encodings
- #
- # Copyright: (c) 1995-1999 by Apple Computer, Inc., all rights
- # reserved.
- #
- # Contact: charsets@apple.com
- #
- # Changes:
- #
- # b02 1999-Sep-22 Update information on Cyrillic. Update
- # contact e-mail address.
- # n07 1998-Feb-05 Rewrite to provide additional information
- # relevant to using the accompanying mapping
- # tables, and to delete some extraneous
- # information. Delete Bulgarian (no special
- # encoding, uses standard Cyrillic), add
- # Farsi, Devanagari, Gurmukhi, Gujarati,
- # Celtic, Gaelic, Inuit, Tibetan.
- # n04 1995-Nov-15 Update info for Hebrew and Thai
- # n03 1995-Apr-15 First version (after fixing some typos).
- #
- ##################
-
- 0. Preliminaries
- ----------------
-
- For maximum interchangeability, this file and the accompanying Mac OS
- mapping tables use only ASCII characters. They are intended to be
- displayed in a monospaced font.
-
- Apple, the Apple logo, Mac, and Macintosh are trademarks of Apple
- Computer, Inc., registered in the United States and other countries.
- QuickDraw and TrueType are trademarks of Apple Computer, Inc. Unicode is
- a trademark of Unicode Inc. PostScript is a trademark of Adobe Systems
- Inc., which may be registered in certain jurisdictions. IBM is a
- registered trademark of International Business Machines Corporation. ITC
- Zapf Dingbats is a registered trademark of the International Typeface
- Corporation. For the sake of brevity, throughout this document and the
- accompanying tables, "Macintosh" can be used to refer to Macintosh
- computers and "Unicode" can be used to refer to the Unicode standard.
-
- Apple Computer, Inc. ("Apple") makes no warranty or representation,
- either express or implied, with respect to this document and the
- accompanying tables, their quality, accuracy, or fitness for a
- particular purpose. In no event will Apple be liable for direct,
- indirect, special, incidental, or consequential damages resulting from
- any defect or inaccuracy in this document or the accompanying tables.
-
- 1. Introduction
- ---------------
-
- This document summarizes some Unicode mapping considerations that are
- relevant for the accompanying mapping tables. It also provides an
- overview of Mac OS encodings.
-
- These mapping tables and character lists are subject to change.
- The latest tables should be available from the following:
-
- <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/>
- <ftp://dev.apple.com/devworld/Technical_Documentation/Misc._Standards/>
-
- 2. Round-trip fidelity and overview of mapping techniques
- ---------------------------------------------------------
-
- For a particular set of national and international standards, Unicode
- provides round-trip fidelity: Text in one of those encodings can be
- mapped to Unicode and back again, yielding the original characters.
- Characters which are distinct in one of these source standards have
- a distinct counterpart in Unicode. Note that this counterpart might not
- be a single Unicode character; as is pointed out in "The Unicode
- Standard, Version 2.0" (page 2-10), "sometimes a single code value in
- another standard corresponds to a sequence of code values in the Unicode
- Standard, or vice versa."
-
- However, Unicode does not attempt to provide round-trip fidelity for
- most vendor standards. Nevertheless, Apple and other platform vendors
- may need to provide such round-trip fidelity for their current encodings
- (this can be important in file systems, for example). In order to do
- this, Apple makes use of some Unicode characters in the corporate-use
- zone (the upper end of the private use area).
-
- Corporate-zone characters must be used with care. Indiscriminate use of
- such characters can result in text which is not easily interchanged with
- other systems, since these characters have no standard meaning outside a
- particular platform. The mappings provided here are intended to minimize
- the use of private use characters, or to use them in such a way that
- basic text content will not be lost if the corporate zone characters are
- dropped when text is transferred to another system.
-
- The tables provided here have three goals, in the following order of
- importance:
- 1. Provide 100% round-trip mapping from a Mac OS encoding to Unicode
- and back (even if the mappings here are converted to maximal
- decompositions, see below).
- 2. Map characters in a Mac OS encoding into the Unicode characters
- that best represent the interpretation and usage of the Mac OS
- characters.
- 3. When mapping text in a Mac OS encoding to Unicode using the tables,
- the resulting Unicode text should be as interchangeable as possible.
-
- To satisfy these goals, the mappings use a variety of techniques. First
- we attempt to achieve round-trip mappings using any standard Unicode
- feature at our disposal, without resorting to corporate-zone characters.
- This can includes the following techniques:
- - Use of all Unicode characters defined in Unicode 2.1, including
- compatibility characters.
- - Mapping a single character in a Mac OS encoding to a sequence of
- standard Unicode characters, or vice versa. This requires grouping
- characters into appropriate chunks for lookup before mapping them
- (this mainly applies to sequences of Unicode characters).
- - Using Unicode direction overrides to force direction attributes when
- mapping to Unicode. This requires resolution of Unicode character
- direction, and use of this information, when mapping from Unicode back
- to certain Mac OS encodings.
- The requirements imposed on Unicode handling are necessary for other,
- non-transcoding operations in a full Unicode implementation anyway, so
- requiring them for transcoding should not impose much of a burden.
-
- Next, if round-trip fidelity cannot be achieved using the above
- techniques, we attempt to use corporate-zone characters only as
- "transcoding hints" (more on this below). These are combined with one or
- more standard Unicode characters to mark them as special for
- transcoding, but have no other function and can be deleted with no loss
- of basic text content (only of round-trip fidelity).
-
- Finally, if a character in a Mac OS encoding is unrelated to any Unicode
- or Unicode sequence, we may map it to a single corporate-zone Unicode
- code point.
-
- These techniques are described in more detail in the following sections.
-
- Some clients of these tables may have a different set of goals. For
- example, some clients may prefer to avoid compatibility characters,
- perhaps sacrificing round-trip fidelity if necessary. In most cases it
- is fairly easy to construct other types of mappings from the mappings
- given here. In particular, the mappings here have been designed so that
- if they are converted to maximal decomposition mappings (by recursive
- application of the canonical decompositions in the Unicode database),
- the resulting mappings will still provide 100% roundtrip fidelity.
-
- There is one more round-trip issue that should be mentioned. If a
- Unicode character or sequence can be mapped at all into a particular
- Mac encoding, then the reverse mapping back to Unicode should yield
- the original Unicode character or sequence (except for possible
- differences in direction overrides or other Unicode characters in the
- "Other, Format" category). The tables here also provide this. For a
- related issue, see the next section.
-
- 3. Mapping tolerance: Strict and loose
- --------------------------------------
-
- In many character sets, a single character may have multiple semantics,
- either by explicit definition, ambiguous definition, or established
- usage. For example, the JIS character 0x2142, or 0x8161 in Shift-JIS,
- is specified in the JIS X0208 standard to have two meanings: "double
- vertical line" and "parallel". Each of these meanings corresponds to a
- different Unicode character: 0x2016 DOUBLE VERTICAL LINE and 0x2225
- PARALLEL TO. When mapping from Unicode to Shift-JIS, it is normally
- desirable to map both of these Unicode characters to the single
- Shift-JIS character. However, when mapping the Shift-JIS character to
- Unicode, we can choose only one of the possible Unicode characters.
-
- For two encodings X and Y, we can define a set of "strict" mappings
- from one to the other as follows: If text in X can be mapped to Y using
- the strict mappings from X to Y, then the resulting text can be mapped
- back using the strict mappings from Y to X to end up with the original
- text from X. Similarly, if text in Y can be mapped to X using the strict
- mappings from Y to X, then the resulting text can be mapped back using
- the strict mappings from X to Y to end up with the original text from Y.
-
- There may be several characters in one encoding that all map to a
- single character in another encoding, but only one of these mappings
- can be strict; the others are "loose".
-
- The mappings given in the accompanying tables are strict mappings.
- However, the Mac OS Text Encoding Converter also supports loose
- mappings and fallback mappings. Some of the accompanying tables provide
- suggestions about possible loose mappings.
-
- 4. Mapping a Mac encoding character to a Unicode sequence or vice versa
- -----------------------------------------------------------------------
-
- In some cases, a character in a Mac OS encoding maps to a sequence of
- Unicode characters. For example, the Mac OS Japanese encoding includes
- a character for the circled CJK ideograph "big". Although Unicode
- encodes other circled ideographs as single characters, it does not
- encode this one. However, this character can be unambiguously
- represented in Unicode as the Unicode sequence 0x5927+0x20DD, the CJK
- ideograph for "big" followed by COMBINING ENCLOSING CIRCLE.
-
- To handle the reverse mapping, a transcoding process must group the
- Unicode sequence 0x5927+0x20DD as a single element for lookup (The
- Mac OS Text Encoding Converter does this).
-
- In a few cases, a sequence of characters in a Mac OS encoding must
- be grouped for mapping to a single Unicode character or a sequence
- of Unicode characters. For example, in Mac OS Devanagari (based on
- ISCII-91), DEVANAGARI LETTER VOCALIC L is represented as 0xA6+0xE9;
- but this is represented in Unicode by the single character 0x090C.
- Furthermore, explicit halant is represented in Mac OS Devanagari as
- 0xE8+0xE8 (double halant) and in Unicode as 0x094D+0x200C (VIRAMA
- plus ZERO WIDTH NON-JOINER). The latter can also be considered as
- a context-dependent mapping of 0xE8, halant.
-
- Loose mappings from Unicode to a Mac OS encoding often map a single
- Unicode to a sequence of characters in the Mac OS encoding. For example,
- the Unicode character 0x00BD VULGAR FRACTION ONE HALF cannot be mapped
- into the Mac OS Roman character set as a single character, but it has a
- loose mapping to the sequence 0x31+0xDA+0x32, "digit one" + "fraction
- slash" + "digit two".
-
- In some cases a Unicode character such as a direction override may
- simply be discarded when mapping to a Mac OS encoding, since the
- information carried by the override may be represented in a different
- way by the Mac OS encoding. See the next section for an example.
-
- 5. Mappings that depend on directionality (or other attributes)
- ---------------------------------------------------------------
-
- Strict mappings from Unicode to Mac OS encodings may depend on resolved
- character direction. Loose mappings may depend on additional attributes
- such as the state of symmetric swapping and whether the text should use
- vertical form codes if available (i.e. whether the text is intended for
- vertical display on a system that cannot automatically substitute
- vertical forms).
-
- a) Resolved character direction
-
- The Mac OS Arabic and Hebrew character sets were developed in 1986-1987.
- At that time the bidirectional line layout algorithm used in the Mac OS
- was fairly simple; it used only a few direction classes (instead of the
- 13 or so now used in the Unicode bidirectional algorithm). In order to
- permit users to handle some tricky layout problems, certain punctuation
- and symbol characters have duplicate code points, one with a left-right
- direction attribute and the other with a right-left direction attribute.
-
- For example, plus sign is encoded at 0x2B with a left-right attribute,
- and at 0xAB with a right-left attribute. However, there is only one PLUS
- SIGN character in Unicode. This leads to some interesting problems when
- mapping between Mac OS Arabic or Hebrew and Unicode.
-
- We need a way to map both of these plus signs to Unicode and back. Using
- a single corporate character for one of these plus signs is not a good
- solution, since both of the plus sign characters are likely to be used
- in text that is interchanged, and thus content would be lost.
-
- The problem is solved with the use of direction override characters and
- direction-dependent mappings. When mapping from Mac OS Arabic or Hebrew
- to Unicode, we use direction overrides as necessary to force the
- direction of the resulting Unicode characters. When mapping back from
- Unicode, the Unicode bidirectional algorithm should be used to determine
- resolved direction of the Unicode characters. The mapping from Unicode
- to Mac OS Arabic or Hebrew can then be disambiguated as necessary by
- using the resolved direction.
-
- For example, when mapping from Mac OS Arabic or Hebrew, we can use
- LEFT-RIGHT OVERRIDE (LRO), RIGHT-LEFT OVERRIDE (RLO), and POP DIRECTION
- FORMATTING (PDF) as follows:
-
- 0x2B -> 0x202D (LRO) + 0x002B (PLUS SIGN) + 0x202C (PDF)
- 0xAB -> 0x202E (RLO) + 0x002B (PLUS SIGN) + 0x202C (PDF)
-
- When mapping back, we resolve the direction of the Unicode character
- 0x002B, and use this information to determine which of the Mac OS
- encoding characters to use:
-
- 0x002B -> 0x2B (if LR) or 0xAB (if RL)
-
- After direction overrides have been used in this way to force a
- particular resolved direction, they may be discarded when mapping from
- Unicode to Mac OS Arabic and Hebrew (since the information they carried
- in Unicode is represented in the Mac OS encoding by the code point of
- the plus sign).
-
- Even when not required for round-trip fidelity, direction overrides
- may be used when mapping from a Mac OS encoding to Unicode in order to
- preserve proper text layout. For example, the single Mac OS Arabic
- ellipsis character has direction class right-left, while the Unicode
- HORIZONTAL ELLIPSIS character has direction class neutral. When
- mapping the Mac OS ellipsis to Unicode, it is surrounded with a
- direction override to help preserve proper text layout. However,
- resolved direction is not needed or used when mapping the Unicode
- HORIZONTAL ELLIPSIS back to Mac OS Arabic.
-
- b) Symmetric swapping
-
- In loose mappings from Unicode to the Mac OS Arabic character set, the
- state of symmetric swapping (which may be changed by the Unicode
- characters 0x206A, 0x206B) affects the mapping of paired characters such
- as punctuation and brackets. This does not affect the strict mappings
- given in the accompanying tables.
-
- c) Horizontal or vertical display
-
- The Mac OS Japanese encoding includes separately-encoded vertical forms
- for some punctuation and kana. When Unicode characters in the CJK
- punctuation and kana ranges are mapped to Mac OS Japanese characters and
- (1) those characters are intended for vertical display, (2) they will be
- displayed in an environment that does not provide automatic vertical
- form substitution, and (3) loose mappings are desired, the Unicode
- characters can be mapped to the corresponding vertical form codes in the
- Mac OS Japanese encoding.
-
- This does not affect mapping of the Unicode vertical presentation forms
- (which always map to the Mac OS Japanese vertical form codes).
-
- 6. Use of corporate characters
- ------------------------------
-
- Apple has defined a block of 32 corporate characters as "transcoding
- hints." These are used in combination with standard Unicode characters
- to force them to be treated in a special way for mapping to other
- encodings; they have no other effect. Sixteen of these transcoding
- hints are "grouping hints" - they indicate that the next 2-4 Unicode
- characters should be treated as a single entity for transcoding. The
- other sixteen transcoding hints are "variant tags" - they are like
- combining characters, and can follow a standard Unicode (or a sequence
- consisting of a base character and other combining characters) to
- cause it to be treated in a special way for transcoding. These always
- terminate a combining-character sequence.
-
- Whenever possible, mappings that require corporate-zone characters
- use standard Unicode characters in combination with a single
- transcoding hint (no mapping uses more than one transcoding hint).
- For these mappings, even if the corporate-zone characters are lost in
- interchange, the basic text content will be preserved.
-
- However, some characters in a Mac OS encoding - such as the Apple
- logo character - bear no relation to any standard Unicode character.
- In these cases, the Mac OS character is mapped to a single corporate
- zone character defined by Apple. Fewer than 15 corporate characters
- are used in this way.
-
- All of the corporate characters defined by Apple are listed in the
- accompanying file "CORPCHAR.TXT", including old Apple corporate
- character assignments which are now deprecated (but which are still
- supported as loose mappings by the Mac OS Text Encoding Converter).
-
- 7. Font variants
- ----------------
-
- For some Mac OS encodings, certain fonts used with that encoding may
- actually implement a slight variant of the standard encoding specified
- in the accompanying mapping tables. The header comments in the mapping
- table files for each encoding describe any font variants associated with
- that encoding.
-
- 8. Mac OS encodings
- -------------------
-
- The Mac OS can support multiple encodings. In the current Mac OS
- architecture these encodings are distinguished primarily by script code:
- font family IDs are grouped into ranges, and each range is associated
- with a script code.
-
- In some cases, there are several encodings that share a single script
- code. Usually these are closely related. To distinguish among these,
- additional information is required, such as font name or system
- region code (locale code).
-
- The encodings described here (and in the accompanying tables) are the
- encodings used in Mac OS versions 7.1 and later. In some cases, certain
- earlier system versions have used different encodings.
-
- In all Mac OS encodings, character codes 0x00-0x7F are identical to
- ASCII, except that
- - in Mac OS Japanese, reverse solidus is replaced by yen sign
- - in Mac OS Arabic, Farsi, and Hebrew, some of the punctuation in this
- range is treated as having strong left-right directionality,
- although the corresponding Unicode characters have neutral
- directionality
- Fonts used as "system" fonts (for menus, dialogs, etc.) have four glyphs
- at code points 0x11-0x14 for transient use by the Menu Manager. These
- glyphs are not intended as characters for use in normal text, and the
- associated code points are not generally interpreted as associated with
- these glyphs. (However, a "system font variant" mapping table could
- provide mappings for these).
-
- Note that in general, character sets cannot be determined from font
- layouts (they are not the same thing!). This is very noticeable with
- Arabic, Hebrew, and Devanagari, for example.
-
- The following is a list of current Mac OS encodings. The accompanying
- tables provide mappings from these encodings to Unicode 2.1.
-
- a) Mac OS encodings for script code 0, smRoman.
-
- * Roman - this is the default for script code 0 (when the special
- cases listed below do not apply). It covers several western European
- languages, and includes math operators and various symbols.
-
- * Symbol - this is the encoding for the font named "Symbol". It includes
- Greek letters, math operators, and miscellaneous symbols. The layout
- of the Symbol character set is identical to the layout of the Adobe
- Symbol encoding vector, with the addition of the Apple logo at 0xF0
- and the EURO SIGN at 0xA0.
-
- * Dingbats - this is the encoding for the font named "Zapf Dingbats".
- The layout of the Dingbats character set is identical to or a superset
- of the layout of the Adobe Zapf Dingbats encoding vector.
-
- * Turkish - this is the encoding if the script code is 0 and the system
- region code is 24, verTurkey. It has 7 code point differences from
- Mac OS Roman.
-
- * Croatian - this is the encoding if the script code is 0 and the system
- region code is any of the following:
- 68, verCroatia
- 66, verSlovenian
- 25, verYugoCroatian (only used in older systems)
- It has 20 code point differences from standard Roman, but only 10
- differences in repertoire.
-
- * Icelandic - this is the encoding if the script code is 0 and the
- system region code is either of the following:
- 21, verIceland
- 47, verFaroeIsl
- It has 6 code point differences from standard Roman. It also has one
- font variant.
-
- * Romanian - this is the encoding if the script code is 0 and the system
- region code is 39, verRomania . It has 6 code point differences from
- standard Roman.
-
- * Celtic - this is the encoding if the script code is 0 and the system
- region code is any of the following:
- 50, verIreland
- 75, verScottishGaelic
- 76, verManxGaelic
- 77, verBreton
- 79, verWelsh
- It is a variant of Mac OS Roman with a few extra accented characters
- for Welsh.
-
- * Gaelic - this is the encoding if the script code is 0 and the system
- region code is 81, verIrishGaelicScript. It is a variant of Mac OS
- Roman, and supports the older Irish orthography using dot above.
-
- * Greek (monotonic) - this is the encoding if the script code is 0 and
- the system region code is 20, verGreece. Although a script code is
- defined for Greek, the Greek localized system does not use it (the
- font family IDs are in the smRoman range). This encoding is based on
- the ISO/IEC 8859-7 repertoire with additional Roman characters for
- French and German, as well as additional symbols. Greek system 4.1
- used a different encoding that matched 8859-7 code points for Greek
- letters. Greek system 6.0.7 also used a variant of the standard
- encoding, but it was quickly replaced by Greek system 6.0.7.1 which
- used the standard encoding.
-
- See also the Central European encoding under script code 29 below.
-
- b) Mac OS encodings for script code 1, smJapanese.
-
- * Japanese - this is the default for script code 1. It is based on a
- Shift-JIS implementation of JIS X0208-1990 ("fullwidth") and
- JIS X0201-1976 ("halfwidth"), with 5 additional one-byte characters
- and one modified character, a set of Apple extension characters which
- include many industry standard extensions, and separate codes for
- vertical forms of some punctuation and kana. There are several font
- variants.
-
- c) Mac OS encodings for script code 2, smTradChinese.
-
- * Chinese Traditional - this is an extension of Big-5.
-
- d) Mac OS encodings for script code 3, smKorean.
-
- * Korean - this is an extension of EUC-KR.
-
- e) Mac OS encodings for script code 4, smArabic.
-
- * Arabic - This is the default for script code 4 (when the special
- case listed below does not apply). It is based on the ISO/IEC 8859-6
- repertoire, with additional Arabic letters for Persian and Urdu and
- with accented Roman letters for European languages. It has the
- interesting feature mentioned above that certain ASCII punctuation
- and symbol characters are encoded twice, once for each direction. It
- has several font variants.
-
- * Farsi - This is the encoding if the script code is 4 and the system
- region code is 48, verIran. It is similar to Mac OS Arabic, but has
- the "extended" or Persian digits instead of the standard Arabic
- digits. It has one font variant.
-
- f) Mac OS encodings for script code 5, smHebrew.
-
- * Hebrew - This is based on the ISO/IEC 8859-8 Hebrew letter repertoire,
- but adds Hebrew points, some Hebrew ligatures, some accented Roman
- letters for European languages, and some non-ASCII punctuation. As
- with Mac OS Arabic, certain ASCII punctuation and symbol characters
- are encoded twice, once for each direction. This is also true for the
- European digits. This has one font variant.
-
- g) Mac OS encodings for script code 6, smGreek.
-
- None currently - see smRoman.
-
- h) Mac OS encodings for script code 7, smCyrillic.
-
- * Cyrillic - This is based on the ISO/IEC 8859-5 Cyrillic character
- repertoire plus an additional case pair for Ukrainian.
-
- i) Mac OS encodings for script code 9, smDevanagari.
-
- * Devanagari - This is based on IS 13194:1991 (ISCII-91), and adds some
- punctuation and symbols.
-
- j) Mac OS encodings for script code 10, smGurmukhi.
-
- * Gurmukhi - This is based on IS 13194:1991 (ISCII-91), and adds some
- punctuation and symbols.
-
- k) Mac OS encodings for script code 11, smGujarati.
-
- * Gujarati - This is based on IS 13194:1991 (ISCII-91), and adds some
- punctuation and symbols.
-
- l) Mac OS encodings for script code 21, smThai.
-
- * Thai - This is based on TIS 620-2533, except that three of the
- TIS 620-2533 characters are replaced with other characters. Some
- undefined code points in TIS 620-2533 are used for additional
- punctuation characters.
-
- m) Mac OS encodings for script code 25, smSimpChinese.
-
- * Chinese Simplified - this is an extension of EUC-CN.
-
- n) Mac OS encodings for script code 26, smTibetan.
-
- * Tibetan
-
- o) Mac OS encodings for script code 28, smEthiopic.
-
- * Inuit - this is the encoding if the script code is 28 and the
- system region code is 78, verNunavut (for Inuktitut language).
- There is no script code for Inuit, so it shares the script code
- with Ethiopic.
-
- p) Mac OS encodings for script code 29, smCentralEuroRoman.
-
- * Central European - This is similar to standard Roman, but with a
- different (and larger) set of European characters and with fewer
- symbols. It is used for Polish, Czech, Slovak, Hungarian, Estonian,
- Latvian, and Lithuanian.
-